For the operation in the future, having a overview can help us understand the dataset easily.
dim(USvideos)
[1] 40949 16
str(USvideos)
Classes ‘spec_tbl_df’, ‘tbl_df’, ‘tbl’ and 'data.frame': 40949 obs. of 16 variables:
$ video_id : chr "2kyS6SvSYSE" "1ZAPwfrtAFY" "5qpjK5DgCt4" "puqaWrEC7tY" ...
$ trending_date : chr "17.14.11" "17.14.11" "17.14.11" "17.14.11" ...
$ title : chr "WE WANT TO TALK ABOUT OUR MARRIAGE" "The Trump Presidency: Last Week Tonight with John Oliver (HBO)" "Racist Superman | Rudy Mancuso, King Bach & Lele Pons" "Nickelback Lyrics: Real or Fake?" ...
$ channel_title : chr "CaseyNeistat" "LastWeekTonight" "Rudy Mancuso" "Good Mythical Morning" ...
$ category_id : num 22 24 23 24 24 28 24 28 1 25 ...
$ publish_time : POSIXct, format: "2017-11-13 17:13:01" "2017-11-13 07:30:00" "2017-11-12 19:05:24" "2017-11-13 11:00:04" ...
$ tags : chr "SHANtell martin" "last week tonight trump presidency\"|\"last week tonight donald trump\"|\"john oliver trump\"|\"donald trump" "racist superman\"|\"rudy\"|\"mancuso\"|\"king\"|\"bach\"|\"racist\"|\"superman\"|\"love\"|\"rudy mancuso poo be"| __truncated__ "rhett and link\"|\"gmm\"|\"good mythical morning\"|\"rhett and link good mythical morning\"|\"good mythical mor"| __truncated__ ...
$ views : num 748374 2418783 3191434 343168 2095731 ...
$ likes : num 57527 97185 146033 10172 132235 ...
$ dislikes : num 2966 6146 5339 666 1989 ...
$ comment_count : num 15954 12703 8181 2146 17518 ...
$ thumbnail_link : chr "https://i.ytimg.com/vi/2kyS6SvSYSE/default.jpg" "https://i.ytimg.com/vi/1ZAPwfrtAFY/default.jpg" "https://i.ytimg.com/vi/5qpjK5DgCt4/default.jpg" "https://i.ytimg.com/vi/puqaWrEC7tY/default.jpg" ...
$ comments_disabled : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ ratings_disabled : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ video_error_or_removed: logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ description : chr "SHANTELL'S CHANNEL - https://www.youtube.com/shantellmartin\\nCANDICE - https://www.lovebilly.com\\n\\nfilmed t"| __truncated__ "One year after the presidential election, John Oliver discusses what we've learned so far and enlists our cathe"| __truncated__ "WATCH MY PREVIOUS VIDEO ▶ \\n\\nSUBSCRIBE ► https://www.youtube.com/channel/UC5jkXpfnBhlDjqh0ir5FsIQ?sub_confir"| __truncated__ "Today we find out if Link is a Nickelback amateur or a secret Nickelback devotee. GMM #1218\\nDon't miss an all"| __truncated__ ...
- attr(*, "problems")=Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 1533544 obs. of 5 variables:
..$ row : int 2 2 2 2 2 2 3 3 3 3 ...
..$ col : chr "tags" "tags" "tags" "tags" ...
..$ expected: chr "delimiter or quote" "delimiter or quote" "delimiter or quote" "delimiter or quote" ...
..$ actual : chr "|" "l" "|" "j" ...
..$ file : chr "'data/USvideos.csv'" "'data/USvideos.csv'" "'data/USvideos.csv'" "'data/USvideos.csv'" ...
- attr(*, "spec")=
.. cols(
.. video_id = [31mcol_character()[39m,
.. trending_date = [31mcol_character()[39m,
.. title = [31mcol_character()[39m,
.. channel_title = [31mcol_character()[39m,
.. category_id = [32mcol_double()[39m,
.. publish_time = [34mcol_datetime(format = "")[39m,
.. tags = [31mcol_character()[39m,
.. views = [32mcol_double()[39m,
.. likes = [32mcol_double()[39m,
.. dislikes = [32mcol_double()[39m,
.. comment_count = [32mcol_double()[39m,
.. thumbnail_link = [31mcol_character()[39m,
.. comments_disabled = [33mcol_logical()[39m,
.. ratings_disabled = [33mcol_logical()[39m,
.. video_error_or_removed = [33mcol_logical()[39m,
.. description = [31mcol_character()[39m
.. )
Now we need to make sure is there any outlier or mistake in the dataset.
First, test the column called “category_id”. There are 43 categories, therefore the values in the column should not be bigger than 43 or smaller than 1.
assert(data = USvideos, in_set(1, 43, allow.na = FALSE), category_id)
Column 'category_id' violates assertion 'in_set(1, 43, allow.na = FALSE)' 38547 times
[omitted 38542 rows]
Error: assertr stopped execution
There are 5 rows have NA in this column, we can just remove them later.
For the numerical columns in the dataset, based on the reality, all of them should be positive.
assert(data = USvideos, within_bounds(lower.bound = 0, upper.bound = Inf, allow.na = FALSE), views)
assert(data = USvideos, within_bounds(lower.bound = 0,upper.bound = Inf, allow.na = FALSE), likes)
assert(data = USvideos, within_bounds(lower.bound = 0, upper.bound = Inf, allow.na = FALSE), dislikes)
assert(data = USvideos, within_bounds(lower.bound = 0, upper.bound = Inf, allow.na = FALSE), comment_count)
Fortunately, all of the numbers are positive. There is no mistake.
And for the logical columns, all of the values should be TRUE or FALSE.
assert(data = USvideos, in_set(TRUE, FALSE, allow.na = FALSE), comments_disabled)
assert(data = USvideos, in_set(TRUE, FALSE, allow.na = FALSE), ratings_disabled)
assert(data = USvideos, in_set(TRUE, FALSE, allow.na = FALSE), video_error_or_removed)
And there is no error too.
Because there are only several observations with NA values, we can just remove all of the rows which have NA value.
USvideos_NNA <- as.data.frame(na.omit(USvideos))
USvideos_NNA
Then we need to convert the column called “trending_date” with character type to normal date format in “lubridate” package.
USvideos_NNA <- USvideos_NNA %>%
mutate(trending_date = ydm(trending_date))
Now let’s look through the structure of dataset again.
str(USvideos_NNA)
'data.frame': 40371 obs. of 16 variables:
$ video_id : chr "2kyS6SvSYSE" "1ZAPwfrtAFY" "5qpjK5DgCt4" "puqaWrEC7tY" ...
$ trending_date : Date, format: "2017-11-14" "2017-11-14" "2017-11-14" "2017-11-14" ...
$ title : chr "WE WANT TO TALK ABOUT OUR MARRIAGE" "The Trump Presidency: Last Week Tonight with John Oliver (HBO)" "Racist Superman | Rudy Mancuso, King Bach & Lele Pons" "Nickelback Lyrics: Real or Fake?" ...
$ channel_title : chr "CaseyNeistat" "LastWeekTonight" "Rudy Mancuso" "Good Mythical Morning" ...
$ category_id : num 22 24 23 24 24 28 24 28 1 25 ...
$ publish_time : POSIXct, format: "2017-11-13 17:13:01" "2017-11-13 07:30:00" "2017-11-12 19:05:24" "2017-11-13 11:00:04" ...
$ tags : chr "SHANtell martin" "last week tonight trump presidency\"|\"last week tonight donald trump\"|\"john oliver trump\"|\"donald trump" "racist superman\"|\"rudy\"|\"mancuso\"|\"king\"|\"bach\"|\"racist\"|\"superman\"|\"love\"|\"rudy mancuso poo be"| __truncated__ "rhett and link\"|\"gmm\"|\"good mythical morning\"|\"rhett and link good mythical morning\"|\"good mythical mor"| __truncated__ ...
$ views : num 748374 2418783 3191434 343168 2095731 ...
$ likes : num 57527 97185 146033 10172 132235 ...
$ dislikes : num 2966 6146 5339 666 1989 ...
$ comment_count : num 15954 12703 8181 2146 17518 ...
$ thumbnail_link : chr "https://i.ytimg.com/vi/2kyS6SvSYSE/default.jpg" "https://i.ytimg.com/vi/1ZAPwfrtAFY/default.jpg" "https://i.ytimg.com/vi/5qpjK5DgCt4/default.jpg" "https://i.ytimg.com/vi/puqaWrEC7tY/default.jpg" ...
$ comments_disabled : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ ratings_disabled : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ video_error_or_removed: logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ description : chr "SHANTELL'S CHANNEL - https://www.youtube.com/shantellmartin\\nCANDICE - https://www.lovebilly.com\\n\\nfilmed t"| __truncated__ "One year after the presidential election, John Oliver discusses what we've learned so far and enlists our cathe"| __truncated__ "WATCH MY PREVIOUS VIDEO ▶ \\n\\nSUBSCRIBE ► https://www.youtube.com/channel/UC5jkXpfnBhlDjqh0ir5FsIQ?sub_confir"| __truncated__ "Today we find out if Link is a Nickelback amateur or a secret Nickelback devotee. GMM #1218\\nDon't miss an all"| __truncated__ ...